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Abstract 

Given i.i.d. data from an unknown distribution, we consider the problem 
of predicting future items. An adaptive way to estimate the probability den- 
sity is to recursively subdivide the domain to an appropriate data-dependent 
granularity. In Bayesian inference one assigns a data-independent prior prob- 
ability to "subdivide", which leads to a prior over infinite(ly many) trees. 
We derive an exact, fast, and simple inference algorithm for such a prior, for 
the data evidence, the predictive distribution, the effective model dimension, 
moments, and other quantities. We prove asymptotic convergence and consis- 
tency results, and illustrate the behavior of our model on some prototypical 
functions. 
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Figure 1: Bins versus Gaussian estimate of the true=data-generating probability 
density. (More decent diagrams will be made for the final version). 



1 Introduction 



Inference. We consider the problem of inference from i.i.d. data in particular 

of the unknown distribution q the data is sampled from. In case of a continuous 
domain this means inferring a probability density from data. Without structural 
assumption on g, this is hard to impossible, since a finite amount of data is never 
sufficient to uniquely select a density (model) from an infinite-dimensional space of 
densities (model class) . 

Methods. In parametric estimation one assumes that q belongs to a finite- 
dimensional family. The two-dimensional family of Gaussians characterized by mean 
and variance is prototypical (Figure 1). The maximum likelihood (ML) estimate of 
q is the distribution that maximizes the data likelihood. Maximum likelihood over- 
fits if the family is too large and especially if it is infinite- dimensional. A remedy 
is to penalize complex distributions by assigning a prior (2nd order) probability to 
the densities q. Maximizing the model posterior (MAP), which is proportional to 
likelihood times the prior, prevents overfitting. A full Bayesian procedure keeps the 
complete posterior for inference. Typically, summaries like the mean and variance 
of the posterior are reported. 

How to choose the prior? In finite or small compact low-dimensional spaces 
a uniform prior often works (MAP reduces to ML). In the non-parametric case 
one typically devises a hierarchy of finite-dimensional model classes of increasing 
dimension. Selecting the dimension with maximal posterior often works well due 
to the Bayes factor phenomenon [Goo83, Goo84, Jef35, Jay03, MacOS]: In case 
the true model is low- dimensional, higher-dimensional (complex) model classes are 
automatically penalized, since they contain fewer "good" models. In a full Bayesian 
treatment one would assign a prior probability (e.g. ^) to the dimension d and mix 
over the dimension. 

Interval Bins. The probably simplest and oldest model for an interval domain is 
to divide the interval (uniformly) into bins, assume a constant distribution within 
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each bin, and take a frequency estimate for the probabihty in each bin (Figure 1), 
or a Dirichlet posterior in Bayesian inference. There are heuristics for choosing the 
number of bins as a function of the data size. The simphcity and easy computabihty 
of the bin model is very appeahng to practitioners. Drawbacks are that distributions 
are discontinuous, its restriction to one dimension (or at most low dimension: curse 
of dimensionality), the uniform (or more generally fixed) discretization, and the 
heuristic choice of the number of bins. We present a full Bayesian solution to these 
problems, except for the non-continuity problem. Our model can be regarded as an 
extension of Polya trees [Fer73, Lav92, Lav94]. 

Related work. There are plenty of alternative Bayesian models that overcome 
some or all of the limitations. Examples are continuous Dirichlet process (mix- 
tures) [Per 73], Bernstein polynomials [PW02], Bayesian field theory [Lem03], ran- 
domized Polya trees [PRLW03], Bayesian bins with boundary averaging [EF05], 
Bayesian kernel density estimation or other mixture models [EW95], and univer- 
sal priors [HutOSb], but exact analytical solutions are infeasible. Markov Chain 
Monte Carlo sampling [Bis06] , Expectation Maximization algorithms [DLR77] , vari- 
ational methods [Bis06], efficient MAP or M(D)L approximations [KM07], or ker- 
nel density estimation [GM03] can often be used to obtain approximate numerical 
solutions, but computation time and/or global convergence remain critical issues. 
There are of course also plenty of non-Bayesian density estimators; see (references 
in) [KF98, BM98, LLW07] in general, and [KK97, KF98] for density tree estimation 
in particular. 

Our tree mixture model. The idea of the model class discussed in this paper is 
very simple: With some (e.g. equal) probability, we chose q either uniform or split the 
domain in two parts (of equal volume), and assign a prior to each part, recursively, 
i.e. in each part again either uniform or split. For finitely many splits, g is a piecewise 
constant function, for infinitely many splits it is virtually any distribution. While 
the prior over q is neutral about uniform versus split, we will see that the posterior 
favors a split if and only if the data clearly indicates non-uniformity. The method is 
a full Bayesian non-heuristic tree approach to adaptive binning for which we present 
a very simple and fast algorithm for computing all(?) quantities of interest. 

Note that we are not arguing that our model performs better in practice than 
the more advanced models above. The main distinguishing feature of our model is 
that it allows for a fast and exact analytical solution. It's likely use is as a building 
block in complex problems, where computation time and Bayesian integration are 
the major issues. In any case, if/since the Polya tree model deserves attention, also 
our model should. 

Contents. In Section 2 we introduce our model and compare it to Polya trees. We 
also discuss some example domains, like intervals, strings, volumes, and classification 
tasks. Section 3 derives recursions for the posterior and the data evidence. Section 
4 proves convergence/consistency. In Section 5 we introduce further quantities of 
interest, including the effective model dimension, the tree size and height, the cell 
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volume, and moments, and present recursions for them. The proper case of infinite 
trees is discussed in Section 6, where we analytically solve the infinite recursion 
at the data separation level. Section 7 collects everything together and presents 
the algorithm. In Section 8 we numerically illustrate the behavior of our model on 
some prototypical functions. Section 9 contains a brief summary, conclusions, and 
outlook, including natural generalizations of our model. See [Hut07] for program 
code. 



2 The Tree Mixture Model 

Setup and basic quantities of interest. We are given i.i.d. data D = (x^,...,x"-) G 
r" of size n from domain T, e.g. TCJR'^, sampled from some unknown probability 
density g : F — > iR. Standard inference problems are to estimate q from D or to 
predict the next data item e F. By definition, the (objective or aleatoric) data 
likelihood density under model q is 

likelihood: p{D\q) = q{xi) • ... ■ q{xn) (1) 

Note that we consider sorted data, which avoids annoying multinomial coefficients. 
Otherwise this has no consequences. Results are independent of the order and 
depend on the counts only, as they should. A Bayesian assumes a (belief or 2"'^- 
order or epistemic or subjective) prior over models q in some model class Q: 

prior: p{q) with q & Q 

The data evidence is 

evidence: piD) = / p(D\q)p{q)dq (2) 

JQ 

Having the evidence, Bayes' famous rule allows to compute the (belief or 2"^'^-order 
or epistemic or subjective) posterior of q: 

posterior: p(m = (3) 

p{D) 

The predictive distribution, i.e. the conditional probability that next data item is 
^_^n+i^ given D, follows from the evidences of D and {D,x): 

T){ D X] 

predictive distribution: pix\D) = — ' (4) 

p[D) 

Since the posterior is a complex object, we need summaries like the expected q- 
probability of x and (co)variances. Fortunately they can also be reduced to compu- 
tation of evidences: 

E[q{x)\D\ := j q(x)p(q\D)dq = / q(x)?^^^^dq 

Jp{D,x\q)p{q)dq p{D,x) 

= W) = W = 



4 



where we used the formulas for the posterior, the hkehhood, the evidence, and the 
predictive distribution, in this order. Similarly for the covariance we obtain 

Coy[q{x)q{y)\D] 

= E[q{x)q{y)\D]-E[q{x)\D]-E[q{x)\D] 
= p{x,y\D) - p{x\D)p{y\D) 

We derive and discuss further summaries of q for our particular tree model, like the 
model complexity or effective dimension, the tree height or cell size, and moments 
later. 

Hierarchical tree partitioning. So far everything has been fairly general. We 
now introduce the tree representation of domain F. We partition F into Fq and 
Fi, i.e. r = FoUFi and FonFi=0. Recursively we (sub)partition rz = TzoiJTzi for 
z e iB^, where iB^ := Ui^fc{0,l}* is the set of all binary strings of length between 
k and m, and Fg^F, where e = {0,l}° is the empty string. We are interested in 
an infinite recursion, but for convenience we assume a finite tree height m<oo and 
consider m— >oo later. Also let l:=i{z) be the length of string z = zi...zi = :zi:i, and 
IF^I the volume or length or cardinality of F^. 

Example spaces (Figures 2 &; 3). Intervals: Assume F= [0,1) is the unit interval, 
recursively bisected into intervals F^ = [0.z,Q.z+2~'-) of length [F^l = 2"', where O.z 
is the real number in [0,1) with binary expansion Zi...zi. 

Strings: Assume Tz = {zy:yE{0,l}"^~''} is the set of strings of length m starting 
with z. Then F = {0,1}'" and jF^I — 2"^~K For m — oo this set is continuous, for 
m<oo finite. 

Trees: Let F be a complete binary tree of height m and F^q (r^i) be the left 
(right) subtree of F^. If jF^j is defined as one more than the number of nodes in F^, 
then |F^| = 2"^+i-'. 

Volumes: Consider TcM'^, e.g. the hypercube F = [0,1)'^. We recursively halve 
F2 with a hypcrplanc orthogonal to dimension (/ mod d) + l, i.e. we sweep through 
all orthogonal directions. |F2| = 2^'|F|. 

Compactification: We can compactify FC (l,oo] (this includes F = W\{1}) to 
the unit interval F' := {- : 2; G F} C [0,1), and similarly F C iR (this includes F = 
to F' : = G [0,1) : ^^^Zl-^ ^ T}. All reasonable spaces can be reduced to one of the 
spaces described above, although this reduction may introduce unwanted artifacts. 

Classification: Consider an observation o G F' (e.g. email) that is classified as 
cG {0,1} (e.g. good versus spam), where F' could be one of the spaces above (e.g. o 
is a sequence of binary features in decreasing order of importance). Then x:—{o,c) G 
F:=F'x{0,l} and Fo2 = F;x{0} and Fi^ = F;x{1}. Given D (e.g. pre- classified 
emails), a new observation is classified as c with probability p{c\D,o) (xp{D,x). 
Similar for more than two classes. 

In all these examples we have (chosen) jF^ol = jF^ij = ^jF^j ^zElBj^'^, and this 
is the only property we need and henceforth assume. W.l.g. we assume/define/ 
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rescale |r| = 1. Generalizations to non-binary and non-symmetric partitions are 
straightforward and briefly discussed at the end. 

Identification. We assume that {F^ : z e ^o*} are (basis) events that generate 

our (T-algebra. For every a; G F let x' be the string of length iix'^ = m such that 
x^Tx'- We assume that distributions q are ci-measurable, i.e. to be constant on 
Fa;/ Wx' G ]B"\ For m = oo this assumption is vacuous; we get all Borel measures. 
Hence, we can identify the continuous sample space F with the (for m < oo discrete) 
space IB^ of binary sequences of length m, i.e. in a sense all example spaces are 
isomorphic. While we have the volume model in mind for real-world applications, 
the string model will be convenient for mathematical notation, the tree metaphor 
will be convenient in discussion, and the interval model will be easiest to implement 
and to present graphically. 

Notation. As described above, F may also be a tree. This interpretation suggests 
the following scheme for deflning the probabihty of q on the leaves x'. The probabihty 
of the left child node zO, given we are in the parent node z, is P[F2;o|F2,gf], so we 
have 

p{x\Tz,q) = p{x\T^o,q)-P[T^o\^^,q] if a; G F^o 

and similarly for the right child. In the following we often have to consider distribu- 
tions conditioned to and in the subtree F^, so the following notation will turn out 
convenient 

q^o := P[T^o\T^,q\, q^i := P[T^i\T^,q\, Pz{x\...) := 2"'p(a;|F^...) (5) 

m 

=^ Pz{x\q) = ^q^xt+iPzxi+iixlq) = ■■■ = H '^^^i-.i if ^ ^ F^ 

1=1+1 

where p(a;|Fa.',g) := |Fa,'|~^ = 2'" is uniform (by assumption). Note that qzo + qzi = ^- 
Finally, let 

qz* := {qzy : y e B^-') 

be the (2"*"'+-'^ — 2)-dimensional vector or ordered set or tree of all reals qzy& [0,1] in 
subtree F^. Note that qz^qz*- The (non) density qz{x) '-—PziAq) depends on all and 
only these qzy. For z^e, qz{) and Pz{) are only proportional to a density due to 
the factor which has been introduced to make Px'{x\...) = 1. (They are densities 
w.r.t. 2'A|r^, where A is the Lebesgue measure.) We have to keep this in mind in 
our derivations, but can ignore this widely in our discussion. 

Polya trees. In the Polya tree model one assumes that the qzo = ^ — qzi are inde- 
pendent and Beta(-,-) distributed, which defines the prior over q. Polya trees form a 
conjugate prior class, since the posterior is also a Polya tree, with empirical counts 
added to the Beta parameters. If the same Beta is chosen in each node, the posterior 
of X is pathological for m— s>oo: The density does nowhere exist with probability 1. 
A cure is to increase the Beta parameters with /, e.g. quadratically, but this results 
in "underfitting" for large sample sizes, since Beta(large,large) is too informative 
and strongly favors q^o near i. It also violates scale invariance, which should ideally 
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hold if we do not have any prior knowledge about the scale. That is, the p(oste)rior 
in ro = [0,|) should be the same as for r = [0,l) (after rescaling all X'^xjl in D). 

The new tree mixture model. The prior •pi^q) follows from specifying a prior 
over g*, since qix^oiq^-^- ...-q^^.^ by (5). The distribution in each subset F^CF shall 
be either uniform or non-uniform. A necessary (but not sufficient) condition for 
uniformity is g^o = (lz\ = \- 

P''(qz0,qzi) := S(q^o-l)Siqzi-l), (6) 

where 6() is the Dirac delta. To get uniformity on F^ we have to rccursc the tree 
down in this way. 

Pz(Qz*) := p"(?zo,?zi)p"o(?20*)p"i(?2i*) (7) 

with the natural recursion termination ^"(gj:*) = 1 when i{z)—'m, since then g^* = 0. 
For a non-uniform distribution on F^ wc allow any probability split ^(F^) = g(F2o) + 
g(F^i), or equivalently 1 = Qzo + Qzi- Wc assume a Beta prior on the split. Scale 
invariance requires the Beta parameters to be the same in all nodes of the tree and 
symmetry demands a symmetric Beta, i.e. 

P%qzo,qzi) ■= Bcta{qzo,qzi\a,a), (8) 
Beta(p,g|a,/3) I^p-^q^-^S{p+q-l) (9) 

where T{a) = J^f^'^e'^dt is the Gamma function. For a—1 this specializes to the 
natural uniform prior p^{qzo,qzi) — SiQzo+Qzi — i) on the split. We now recurse down 
the tree 

pHqz*) p'{qzo:qzi)Pzo{qzo*)Pzi{qzi*) (lo) 

again with the natural recursion termination Pz{qz*) —p{0) — 1 when i{z) —m. Finally 
we have to mix the uniform with the non-uniform case. 

Pz{qz*) u-p'^iqz^) + s-p^q^^) (11) 

with ■u,sG [0,1] and u+s = l. The 50/50 mixture u = s = ^ will be of special interest. 
This completes the specification of the prior p{q) =Pe{(l*)-^ 

For example, if the first bit in x is a class label and the remaining are binary 
features in decreasing order of importance, then given class and features z = xia, 
further features x^+i^m could be relevant for classification {qz{x) is non- uniform) or 
irrelevant {qz{x) is uniform). 

Comparison to the Polya tree. Note the important difference in the recursions 
(7) and (10). Once we decided on a uniform distribution (6) wc have to equally split 
probabilities down the recursion to the end, i.e. we recurse in (7) with p", rather 
than the mixture p (this actually allows to solve the recursion) . On the other hand if 

"'^Note that Pzisiz*) is not the marginal oip{q) to qz*, but one can show that Pz{<lz*)=p{<lz*\qzi 7^ 
^,...,qz^.i and optionally additional conditions on some or all q^qz*. 
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we decided on a non-uniform split (8), the left and right partition each itself may be 
uniform or not, i.e. wc recurse in (10) with the mixture p, rather than p^. Inserting 
(8) in (10) in (11) and recursively (6) in (7) in (11) we get the following recursion 
for the prior 



Choosing u = would lead to the Polya tree model (and its problems) with Qzq ~ 
Beta(-|Q!,Q!). With p instead of p^ on the r.h.s. of (7) we would get a quasi-Polya 
model (same problems) with g2;o~it-Beta(-|oo,oo)-|-s-Beta(-|Q;,Q;). 

For m— >oo, our model is "scale" invariant and leads to continuous densities for 
n— >-cxo, unlike the Polya tree model. We also don't have to tune Beta parameters. We 
can use a non-informative prior like a = l and u = s = ^. The model "tunes itself" by 
suitably assigning high/low posterior probability to subdividing cells. While Polya 
trees form a natural conjugate prior class, our prior does not directly, but may be 
generalized to do so. The computational complexity for the quantities of interest 
will be the same (essentially 0(n)), i.e. as good as it could be. 

Formal and effective dimension. Formally our model is 2- (2™ — l)-dimensional, 
but the effective dimension can by much smaller, since is forced with a non-zero 
probability to a much smaller polytope, for instance with probability u to the zero- 
dimensional globally uniform distribution. We will compute the effective p(oste)rior 
dimension. Alternatively, we could have considered a mixture over all (= lower 
dimensional) partial trees with P^ as leaf if q is uniform on P^, but considering one 
complete tree is more convenient for analytical manipulation. 

3 Evidence and Posterior Recursion 

At the end of Section 2 we defined our tree mixture model. The next step is to 
compute the standard quantities of interest defined at the beginning of Section 
2. The evidence (2) is key, the other quantities (posterior, predictive distribution, 
expected q{x) and its variance) follow then immediately. Let 



be the riz '■— \Dz\ data points that lie in subtree P^. We compute Pz{Dz) recursively 
for all ze1B^~^, which gives p{D)—p^{D^). 

Theorem 1 (Evidence recursion) For £{z) < m the recursion for the evidence is 



PziQz*) = u Jl ^iQzy-l) + s-Beta{qzo,qzi\o!,a)pzQ{qzo*)Pzi{qzi*) 



(12) 



Dz := {xeD:xe P^} 



w{nzo,nzi) 



Pz{Dz) 



Pzo{Dzo)Pzi{Dzi) 

I + S r 

WyUzO, Uzl) 

2-^--P(n^ + 2a) T{af 



(13) 



(14) 
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The recursion terminates with Pz{Dz) = l when £{z)—m. 



The recursion (13) follows by multiplying (16) in Theorem 2 (stated and proven 
below) with Pz{Dz) and adding u. For a=l, (13) and in particular the weight — 
2~"^(n2 + l)(^''^) can be interpreted as follows: With probabihty u, the evidence 
is uniform in F^. Otherwise data is split into two partitions of size n^o and 
'>T'zi='nz—nzo. First, choose Uzo uniformly in {0,...,nz}- Second, given n^, choose 
uniformly among the {^""^J possibilities of selecting n^o out of data points for F^q 
(the remaining Hzi are then in F^i). Third, distribute D^o according to Pzo{Dzo) and 
Dzi according to Pzi{Dzi)- Then, the evidence in case of a split is the second term 
in (13). The factor 2"^ is due to our normahzation convention (5). This also verifies 
that the r.h.s. yields the l.h.s. if integrated over all Dz, as it should be. For n^— >^oo 
we will show in Section 4 that — > cxo if n^o ~ '^zi ~^ oo and otherwise, 

indicating that the weight w is large (small) for (non)uniform distribution, as it 
should be. 

Theorem 2 (Posterior recursion) For £{z) < m the recursion for the posterior is 
Pz{qzADz) = —p-. \[K<lzy-\) (15) 

+ gz{Dz) Beta(g20 , 9^1 k^o + « , "-zi + a)P20 ( Qzo* I -D^o)P2i (?zi* I -D^i ) 

/n ^ Pzo{Dzo)Pzi{Dzi) T u 

9z{Dz) := s — r = 1 -—T 16 

Pz[Dz)w{nzo,nzi) Pz{Dz) 

The recursion terminates with Pz{qz*\Dz) = i when £{z)—m. 

gz{Dz) may be interpreted as the posterior probability of splitting F^. 
Proof. Using Bayes rule (3) we represent the posterior as 

Pz{qz*\Dz)pz{Dz) = Pz{Dz\qz*)Pz{qz*) (17) 
and further substitute Pz{qz*)—upl{qz*)-\-sp%{qz*) (H)- For the uniform part we get 
Pz{Dz\qz*)-Pziqz*) = n (25^w+r----2fe:r„)-n^(9^2/ - I) 

= n%^.-i)' (18) 

where we recursively inserted (6) in (7), and (1) and (5) into (18). Due to the 5, we 
can simply set all qzy — ^- For the spht we get 

Pz{Dz\qz*)-pl{qz*) 

= {jl'2qzo)Pzo{Dzo\qzo*)(jl2qz^Pzi{Dzi\qzi*) (19) 
xeDzo xeDzi 
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'y^VzQ{MzQA^zQ)v{DzQ)Vz\{^z\A^z\)v{Dz\) 

= ^gz{D^)p^{D^)Beta{qzo,qzi\n^o + a,nzi + a) (21) 
xPzoiQzO* I Dzo)pzi {q^u I D^i ) 

In (19) we split into and D^i and used (1) and (5) and the fact that 
Pzo{Dzo\qz*) depends on q through g^o* only. We also inserted (9) in (8) in (10) 
in (19) and used rizo+nzi — riz- Rearranging terms and using Bayes rule (17) for 
subtrees Tzo and F^i we get (20). The last equality is easiest proven backwards by 
inserting (16) and w (14) and Beta (9) into (21). Inserting (11) and (18) and 
(19)-(21) into (17) and dividing by Pz{Dz) yields (15). 

Integrating (15) over qz* and noting that jdqz* = Jdqzodqzi- Jdqzo*- Jdqzu factor- 
izes and that n<^() and Beta() and the qz*, qzo* and qzi* posteriors are all proper 
densities which integrate to 1, we get 

Pz{Dz) 

This shows the last equality in (16). Theorem 1 (13) now follows by multiplying 
(16) with pz{Dz) and adding u. 

For a formal proof of the recursion termination, recall (5): For t[z) —m and 
xeFz we have F^/ = F^ ^ p^{x\q) = l ^ Pz{Dz\q) = l ^ Pz{Dz) = l. □ 



4 Asymptotic Convergence / Consistency (n oo) 

Discussing the weight. The relative probability of splitting (second term on r.h.s. 
of (13)) to the uniform case (first term in r.h.s. of (13)) is controlled by the weight 
w. Large (small) weight indicates a (non)uniform distribution, provided Pzo and Pzi 
arc 0(1). The balance A^^O (t^O) indicates a (non) symmetric partitioning of the 
data among the left and right branch of F^^. Asymptotically for large Uz (and small 
A^), we have 



^^^(A^) ^ c, 




where c„ >0 is some finite constant. Assume that data D is sampled from the true 
distribution q. The probability of the left branch F^o of F^ is qzo = P[Fzo\Fz,Q] — 
2''qz{Tzo). The relative frequencies ^ asymptotically converge to qzo- More precisely 
^ = qzo^O{nJ^^^). Similarly for the right branch. Assume the probabilities are 
equal {qzo = 'izi = \), possibly but not necessarily due to a uniform g^() on F^. Then 
Az = 0{n~^^'^), which implies 

'tJ^nA^z) ~ ^iV^z) oo if qzo = qzi = |, 
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consistent with our anticipation. Conversely, for q^o 7^ Qzi (which imphes non- 
uniformity of QzO) we have A^— >c:=^20~|7^0, which imphes 

w^^{A,) ^ ^ e-'^''^' if Qzo^qzi, 

again, consistent with our anticipation. Formally, the following can be proven: 
Theorem 3 (Weight asymptotics) Forqzo = ^ we have with probability 1 (w.p.l) 

i) hm — — § Wn (A,) = 00, and 

ii) limsupW — w^JA^) = Cq 
where CQ, = 4"~^r(Q;)^/r(2Q;). For g^oT^I have w.p.l 

Hi) lim e^"^"'w;n,(A^) = V |c| < |g^o-|| 

rij— >oo ^ V / zi 

Proof. We will drop the index z everywhere. We need an asymptotic representation 
of w for no,ni— >oo. Using Stirling's approximation lar{x) — {x—^)lax—x+^la{2n) + 
O(^) we get after some algebra 

ln«;„(A) = -n[i/(i)-i/(i±A)] + iln^ (22) 

+ (2a-l)if(|±A) + C„ + 0(^ + ^), 
Hip) = — plnp — (1— p) ln(l— p) = Entropy(|j), 

A — A — ^("0""!) A _ A — HO _ i 

^ n+2a-l^"- n+2o-l ) ^ „ 2' 

C« = 21nr(a) -lnr(2a) 
(i) follows from the law of the iterated logarithm 

\Xi + ... + Xri-n/j] 

hmsup . = 1 w.p.l 

n-^oo ay n In In n 

for i.i.d. random variables with mean // and variance . For the i*'* 

data item in £), let X^ — l if x e and X^ — ^ if x^D^. Then the are i.i.d. 
Bernoulli(^o) with n — qQ — \ and cr^ = ^o^i = |- Further, = no implies 

— n// = nA„ implies lim sup^y^j^^j A„| = 1 w.p.l. implies 

A^ < (1+e)^ w.p.l 

for all sufficiently large n and any £ > 0. Using A = A-|-0(i), (22) can be further 
approximated by 

ln«;„(A) = -n\E% - E{\±t.)\ + | In ^ + 0(1) 
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A Taylor series expansion around A = yields 

H{^-H{\±^) = 2A2 + 0(A^) < (l+5)kUan+o((ii^)2) 

which implies 

lnw„(A„) - iln^ + lnlnn > i(l - Inlnn + 0(1) — >■ oo for e<l 

which implies {i) by exponentiation. 

(ii) (a) Convexity of Inr(x) implies that lnty„(A) is concave and symmetric in A, 
hence lnw„(A) assumes its global maximum at A = 0. (6) From (22) it follows that 
lnWn(0) = iln^ + (2a-l)if(|) + Cc. + C'(^). (c) {2n/^ri)n=i is a symmetric random 
walk, hence infinitely often passes zero w.p.l. (a) and (6) imply the < and (6) and 
(c) the > in limsup„[lnu'„(A„) — |ln^] = (2q; — I)ln2 + Ca w.p.l. Exponentiation 
yields {ii). 

(iii) Since A„^go-| w.p.l, (22) implies 2nc^ + \nwn-^n\2c^ ~ H{\)+H{%)]^ 
— oo, since i7(|)— if(go) >2(go — |)^>2c^. The asymptotic representation also holds 
for uq — Q or ni = 0, hence {iii) follows by exponentiation for all including and 
1. □ 



Asymptotics of the evidence p{D). The typical use of the posterior p{x\D) is 
as an estimate for the unknown true distribution q{x). This makes sense if p{x\D) is 
close to q{x). We show that the finite tree mixture model is indeed consistent in the 
sense that p{x\D) converges^ to q{x) and the posterior of g() concentrates around 
the true value g() for n^oo. 

Theorem 4 (Evidence asymptotics) Forfixedm<oo andriz^oo, the posterior 
Pz{x\Dz)^qz{x) for all xeF^. Furthermore, for the evidence w.p.l we have 



Pz{D,) 



u for uniform g^() and I < m, 
= 1 for I — m, 

— ^ oo for non-uniform qzi) provided s > 0. 



Proof by induction on /. We have to show slightly more, namely also that 
Pz{Dz,x) c G {■u,l,oo}. For l = m, the theorem is obvious, since qz{x) must be 
uniform on F^ and pz{Dz) = l=Pz{Dz,x), hence pz{x\Dz) = l=qz{x). Now assume 
the theorem holds for F^o and F^i and / < m. We show that it then also holds for 
F^. Assume u>0 first. 

(a) Assume first, that qz{) is uniform. This implies that also g^oO and qzi{) are 
uniform, hence n^c'n.^i — > oo, hence by induction hypothesis, Pzo{Dzo) and Pzi{Dzi) 
are bmmded. Further. (A-.) oo for — > oo (by Theorem 3(i) and {ii))- 

^AU statements hold with probability 1 (w.p.l). 
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Hence, p^iD^) u from (13). Similarly Pz{Dz,x) — > u, hence Pz{x\Dz) — > 1 = Qzix) 
for x^Tz- 

(b) We now consider the case of non-uniform QzO- (i) Consider the case that 
<izoO oi' QziO (or both) are non-uniform first. Pzo{Dzo)^'^>^ Pzi{Dzi)>u>0, 
and one of them diverges exponentially. Since grows at most with 0{y/n^) by 
Theorem 3{ii), we see from (13) that also Pz{Dz)^s-pzo{Dzo)Pzi{Dzi)/wn^ diverges 
exponentially, and similarly pz{Dz,x). {ii) If both g^o() and qzi{) are uniform, then 

7^ Ij since we assumed non-uniform g^O- This implies bounded Pzo{Dzo) and 
Pzi{Dzi)-, but exponentially diverging by Theorem 3{m). Hence, again, Pz{Dz) 
and similarly pz{Dz,x) diverge exponentially. In both cases, (i) and (ii), assuming 
w.l.g. x&Fzo, the ratio is 

Pz{x\D) ~ -^^.p^^{x\Dzo) = 2-^-pzo{x\Dzo) ~ 2-qzo-qzo{x) = g.(x) (23) 

See (5) for notation and how the density factor 2 disappears. For u = 0, (23) holds 
for any QzO- Further, pz(Dz) ^^oo still holds, since Pzo{Dzo) tends not faster than 
polynomially to zero by induction. □ 



Theorem 5 (Posterior consistency) The posterior of Qz^ concentrates for Uz 
oo around the true value q^^ w.p.L, i.e.^ 

Pz{qz*\Dz) "^=^ n Kflzy-^zy) (24) 

Proof. We prove consistency (24) by induction over /. For l{z) = m the l.h.s. and 
r.h.s. are formally 1, since a density over an empty space and an empty product 
are defined as 1. Assume that consistency (24) holds for ^{zQ) = i{zl) = 1 + 1. For 
n^— >oo, the Beta concentrates around >^^o and ^-g^i w.p.l: 

Bei&{qzQ,qzi\nzQ+a,nzi+a) 5{qzo-qzo)S{qzi-qzi) 
Inserting this and (24) for zO and zl into the r.h.s. of recursion (15) we get 

Pz{qz*\Dz) -777T n ^(^^y - I) + (1 - rTTTrJ 11 ^i^^y - ^^y) (25) 



For uniform q^^, i.e. \ = qzy yy & IB^~^ the r.h.s. reduces to the r.h.s. of (24). For 
non-uniform q^^, Theorem 4{iii) shows that Pz{Dz) ^ oo (exponentially), and the 
r.h.s. of (25) converges (rapidly) to the r.h.s. of (24). □ 



^The topology of weak convergence or convergence in distribution is used. 
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5 More Quantities of Interest 



In this section we introduce further quantities of interest and present recursions for 
them. They all can be written as expectations 

Ez[f{qz*)\D^] := j f{qz*)Pz{qzAD^)dq^^ (26) 

for suitable functions / (and similarly for P2[...|...] = P[...|r2...]). For instance for 
the evidence (13) we used f{qz*) = 1 and for the posterior (15) formally f{qz*) = 
YlyS{qzy — q'zy)- Below we consider the model dimension, cell number, tree height, 
cell size, and moments. The derivations of the recursion all follow the same scheme, 
inserting the (recursive) definition of / and the recursion (15) into the r.h.s. of (26), 
and rearranging and identifying terms. These details will be omitted. 

Model dimension and cell number. As discussed in Section 2, the effective 
dimension of is the number of components that are not forced to | by (6). Note 
that a component may be "accidentally" | in (8), but since this is an event of 
probability 0, we don't have to care about this subtlety. So the effective dimension 
^qz* — ^qz*'q¥'l} of qz* can be given recursively as 

_ if ^{z)=m or q,o ^ ^ , . 

- 1 1 + TV,- 0. + else 

The effective dimension is zero if qzo — \-, since this implies that the whole tree 
has qzy — \ due to (12). If g^T^I, we add the effective dimensions of subtrees F^o and 

Vzi to the root degree of freedom qzo = ^—qzi- We may be interested in the expected 
effective dimension E[N^^\D]. Inserting (27) {f{qz*)=N^_,^) and (15) into the r.h.s. 
of (26) we can prove the following recursion for the expected effective dimension 

Ez[N^^pz] = gz{D^[l+Ezo[N^^,pzo] + Ezi[N^^,pzi]] (28) 

Read: The expected dimension of if^* (l.h.s.) equals to 1 for the root degree of 
freedom plus the expected dimensions in the left and right subtrees, multiplied 
with the probability gz{Dz) of splitting F^ (r.h.s.). The recursion terminates with 
Ez[N^^ADz] = when l{z) —m. Higher (central) moments hke the variance can be 
computed similarly. One can also compute the whole distribution {P[N^^ —k\D])k^]jsr 
by convolution. Inserting 



\ X ■ if Z = m or g^o = |, .„„^ 

/fe.) . (29) 

and (15) into (26) we get 



P,[%, = 0\Dz] = 1 - gz{D,), for I < m, 

Pz[N,-.,=k + l\Dz] = (30) 

k 



i=Q 



P,[%, = 5,0 := {J :JJ=S} for I ^ m. 
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Read: The probability that tree has dimension k + 1 equals the probability of 
splitting, times the probability that left subtree has dimension i, times the proba- 
bility that right subtree has dimension k—i, summed over all possible i. Again, this 
follows from inserting (29) and (15) into (26). 

Let us define a cell or bin as a maximal volume on which g() is constant. Then 
the model dimension is 1 less than the number of bins (due to the probability 
constraint). Hence adding 1 to the above quantities we also have expressions for the 
expected number of cells and distribution. 

Tree height and cell size. The effective height of tree g^* at a; G is also an 
interesting property. If qzQ = \ or i{z)=m, then the height h(^_^^{x) of tree g^* at x 
is obviously zero. If g^oT^ |, we take the height of the subtree qzxi+i* that contains 
X and add 1: 



if 



(z) = m 
else 



or Qzo 



One can show that the tree height at x e F^ averaged over all trees Qz* is 



Elh,^Jx)\D,] = g,{D,)\l + E,,^Jh,^ {x)\D 



We may also want to compute the tree height averaged over all xeF^.. 

f if i{z) = m or q^o 



1 + q^Q-h 



qzo* 



qzi-h. 



qzi* 



(z) = m 
else 



1 + 



+ 



nzo+ OL 

riz + 2a 
nzi + a 

Uz + 2a 



Ezo[h^zo*\^zo] 

Ezl[hq^l^Dzl] 



with obvious interpretation: The expected height of a subtree is weighted by its 
relative importance, that is (an estimate of) its probability. The recursion terminates 
with Ez[h^^^\Dz] — when i{z) =m. We can also compute intra and inter tree height 
variances. 

Finally consider the average cell size or volume v. Maybe more useful is to con- 
sider the logarithm —\0g2\T z\= (!-{z) , since otherwise small volumes can get swamped 
in the expectation by a single large one. Log- volume v^^^=l{z) if l{z)=m or ^20 = |, 
and else recursively v^^^ = ^zo^^gio* + 5^i'^gzi*- We can reduce this to the tree height, 
since v^^^^h^^^^-l{z), in particular v^^^h^^ 

Moments in x. Assume data xeT— [0,1) are sampled from q{). The mean and 
variance of x w.r.t. q{) are important statistical quantities. More generally consider 



M. 



1 



9z* 



M{x)qz{x)dx 
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(jF7 / qz{x)dx — l). Since is itself random, it is natural to consider the p-expected 
g-mean (26) of M 

lE,[M{x)\D,] := E,[M,^JD,] = t^J' M{x)p,{x\D,)dx (31) 
Inserting recursion (13) for Pz{Dz,x) using (4) in (31) we get 



lE,[M{x)\D,] = ^-^ + g^(D,) -^^lE,o[M(x)\D,o] + ^^IE,,[M(x)\D,^] 

(32) 

again with obvious interpretation: The expectation of M is a mixture of a uniform 
expectation := jj^/p^ M(x)(i,T and the weighted average of expectations in left 
and right subtree. The recursion terminates with lEz[M{x)\Dz] = Mz when l = m. 
Examples: For M{x) = l, both sides of (32) evaluate to 1 as it should. For the A;th 
moment of x, M{x)^x^ we have M^ = [(z+l)'=+^-z'=+^]/[2'='(A;+l)], where ^ = 2'0.z 
is interpreted as an integer in binary representation. The distribution function 
Pz[x<a\D^] is obtained for M(a;) = { J jjg^. For ogF^ wc have Mz = 2^a-z. Since 

M[x) is constant on F^^a, we have iE'^[M(a;)|D2] = in this case, hence 

only one recursion in (32) needs to be expanded (since a^F^o or a^F^i). 



6 Infinite Trees (m— >oo) 



Motivation. We have chosen an (arbitrary) finite tree height m in our setup, 
needed to have a well-defined recursion start at the leaves of the trees. What we 
are really interested in are infinite trees {m — oo). Why not feel lucky with finite 
ml First, for continuous domain F (e.g. interval [0,1)), our tree model contains only 
piecewise constant models. The true distribution g() is typically non-constant and 
continuous (Beta, normal, ...). Such distributions are outside a finite tree model 
class (but inside the infinite model), and the posterior p{x\D) cannot converge to 
the true distribution, since it is also piecewise constant. Hence all other estimators 
based on the posterior are also not consistent. Second, a finite m violates scale 
invariance (a non- informative prior on F^ should be the same for all z, apart from 
scaling). Finally, having to choose the "right" m may be worrisome. 

For increasing m, the cells F-^ become smaller and will (normally) eventually 
contain either only a single data item, or be empty. It should not matter whether we 
further subdivide empty or singleton cells. So we expect inferences to be independent 
of m for sufficiently large m, or at least the limit m^oo to exist. In this section we 
show that this is essentially true, but with interesting exceptions. 

Prior inferences {D = 0). We first consider the prior (zero data) case D = 0. 
Recall that z G IB^ is some node and x G iB™ a leaf node. Normalization implies 
Pz{0) — 1 for all z, which is independent of m, hence the prior evidence exists for 
m— >oo (see below for a formal proof). This is nice, but hardly surprising. 
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The prior effective model dimension A^^; is more interesting. D—0 implies Dz—0 
implies Uz — O implies w{nzo,nzi) = l implies a chance gz{0) — l—u — s for a split (see 
(16)). The recursion (28) reduces to 

with Ez[N^^^] — for l = £{z)=m and is easily solved: 

r "l:^ ^ if s<- 

E.W.^J = .(1 + 2.(1 + ...)) = s = i(m-0-oo if . = i 

[ ^-^-oo if 

This can be understood as follows. Consider trees truncated at nodes with uniform 
distribution. Assume that there are k{l) nodes at height With probability u the 
node is a leaf, and with probability s it has two children. So the expected number 
of nodes at height l+l is k{l+l) = 2s-k{l). So the number of nodes exponentially in- 
creases/decreases with Z for s>|/s<|, which results in a infinite/finite total number 
of nodes (=dimension). The linear divergence for s = ^ looks alerting (overfitting) , 
but we now show that the distribution exists and an infinite expectation is actually 
a good sign. Recursion (30) reads 

k 

with Pz[Nq-^^ =0]=u for / < m and Pz[Nq-^^ = k] = Sko for I = m. So the recursion 
terminates in recursion depth min{A; + l,m — /}. Hence Pz[Nq-^^ = k + 1] is the same 
for all m>l + k, which implies that the limit m^oo exists. Furthermore, recursion 
and termination are independent of z, hence also -.— PzlN^^^ —k]. So we have to 
solve the recursion 

k 

Qk+i ^ s^tti-Qk-i with ao^u (33) 

i=0 

The first few coefficients can be bootstrapped by hand, e.g. for s = | we get (a = 
l'l'^'Ti8'2i6'Tii4'2ii8 ^ closed form can also be obtained: Inserting (33) into 
f (x) ■.=J2h=o'^kx'^^^ we get f{x)=ux+sp{x) with solution f{x) = ^[1 — \/l — 4smx], 
which has Taylor expansion coefficients 

^ ' \k + \j {k + l)A>'\k) 0r 

For s < \, {ak)k£iNo is a well-behaved properly normalized probability measure 
(Z]fcC^A: = /(l) = l<oo). For s<| it decreases exponentially in k, implying that all mo- 
ments exist and indicating strong bias towards simple models. For s = ^, ak^k~^^^ 
decreases polynomially in k, too slow for the expectation E[N^^] = J2kk-0'k — oo to ex- 
ist, but this is exactly how a proper non- informative prior on ]N should look like: as 
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uniform as possible, i.e. slowly decreasing. Further, P[N^^<oo]—J2k'^k — i shows that 
the effective dimension is almost surely finite, i.e. infinite (Polya) trees have probabil- 
ity zero for s< |. On the other hand, for s> |, we have P[Nq'^ = oo] = 1-/(1) = 2 — -^, 
i.e. a non-zero probability for infinite trees. The reason why also exponentially 
decreases in this case is that as deeper a tree grows as less likely it stays finite. These 
results are consistent with the expected model dimension. They indicate a proper 
behavior of our model for s < | and in particular for s = |. 

For the tree height we have Ez[h^^^{x)] — ii l — m and otherwise 

= s + s^ + ... + 3""-^ 

i s.wr!"^^^ if s<l, 

— J l—s 1—s ' 

1 7 m— >0O -r -, 

m — l — > oo it S = 1, 

i.e. the prior expected height is large/small if the splitting probability s is 
large/small. The same holds for the expected average height Ez[h^^^] -^jz^- This 
is the first case where the result is not independent of m for large finite m, but it 
converges for m— >oo, what is enough for our purpose. Note that a finite expected 
tree height even for 1 > s > | is consistent with an infinite model dimension, since 
h — oo only for a vanishing fraction of tree branches xgF, i.e. for a set of measure 
zero. 

The prior moments M{x) are easy to compute: Since Pz{x) = 1, we get 
E,[M{x)]=M,. 

Multi-points D = (x^,...,x^). The next situation we analyze is multiple points 
D—{x^,...,x^), where all data points are identical. For continuous spaces and non- 
singular priors, the probability of such an event is zero, so this scenario does not seem 
particulary interesting, but: When computing posteriors, x is not chosen randomly 
but deliberately, so in computing p{D,x), x could be equal to x^ (although again 
only with probability zero). When computing higher moments we need p{D,x,...,x) 
and definitely encounter multiple points. Also, multi-points help to understand the 
case when x or x^ comes very close to x^. Also the true prior may be singular causing 
x^ = x'^ with non-zero probability. Finally, the multi-point case includes n — 1, which 
we have to analyze in any case. 

is either oi D. -D^ = has been studied in the last §, so we assume Dz = D. 
Either D^q = or D^i — 0. W.l.g. we assume the latter. Then D^q — D, n^o = n, 
— which implies '^^(^zc^zi) =2"""^|^^^^^. Defining w:—s/w, recursion (13) 
reduces to 

I _ ^m-l 

p,{D) = u + w-p,o{D) = = u— — + 

1 — w 

= 1 if w = s 

if w< 1 

1—w 

= u{m — l) + 1 oo if w — 1 

_^ M^^m-l ^ if W > 1 

TO— 1 
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(34) 



So the evidence exists for m — > oo iff w < 1. For n = and n = 1 we have w = s <1 
(excluding s = 1), hence p(0) = 1 is correctly normalized, as claimed in the previous §, 
and p{x) = 1 is uniform as symmetry demands. For double points n — 2 the evidence 
p{x,x) is still finite iff w = s- ^^"^^2 ^ ^- latter is true for s < | and all a>0. For 
any n but a = l, tZ)<liffs< (n+l)2~", i.e. higher multi-point evidences only exist 
for exponentially small s. If we want w<l \/a>0, s<2^^" has to be even smaller. 
This follows from w being increasing in n and decreasing in a {w\s for a-^oo). 
To conclude: For every fixed a and s, multiplicity n of points must not be too high, 
but for any n one can choose a sufficiently large or s sufficiently small so that the 
n-multi-point evidence and hence the n*'* moments exist. 
For the mean of M[x) we get 

lE,[M(x)\D] = (l-^;)M, + ^-^M,o + («;^)iE.i[M(x)|L'] 

if x^ e r^i and similarly for x^ G F^q, where w = min{w,l}. This is not a self- 
consistency equation, but since w^^^ < 1, the linear recursion converges exponen- 
tially to the exact value. 

Multi-points for a = 1 and s = ^. We present some more results for the most 
interesting case a — 1 and s — -^, which we will also investigate numerically. 

We see that p[0) = l is correctly normalized, and p{x) = l is uniform as symmetry 
demands. For double points, the evidence p{x,x) — > | is still finite. It diverges linearly 
for triple points and exponentially for quadruple- and-higher points. So q{x) has 
finite prior mean E[q{x)] = 1 and variance Var[g(a;)] = | (Section 2). The skewness 
and kurtosis are infinite, indicating a heavy tail, as desired for a non-informative 
prior. 

Since p{D) = 1 and w = l for n = 1 arc the same as for the n = case, all prior 
n=0, m—i^cxD results remain valid forn= 1: g{x) — ^, E[N^^\x]=oo, P[N^^ = k\x]=ak, 
and E[h^^{x)\x] 1. 

For n = 2 we get g{x,x) = |, bk := -P[A^q; = k\x,x], bo = l-g{x,x) = |, 6^+1 = 
iEtoMfc-.= (iiJs'^v-), h{x) :=^-„6,a;^+i = i[l + 2Mx)/(a;)] = ^:p5^, and 
E[hq-^{x)\x,x] — >2. 

For n>3 we have g^{D) = l, bo = ^ bk = Q ^ P[N^,<oo\D]=0, E[h^^{x)\D] = 
m— >oo for x — x^. The tree at x has infinite height and singular distribution. 
For aU n we have p{D) > 1, g{D) > \, |D] ~2^(D) for m^cx). 

Not a double point, but also straightforward to compute is p(a^,y) = | — (l)'"*"^ if 
xgF^o and yeF^i, i.e. x and y separate at level l — i{z)^ consistent with p(a;,x) = | 
(/=oo). 

General D. We now consider general D. In order to compute p(-D) and other 
quantities, we recurse (13) down the tree until is a multi-point = : = 
{x',...,x') with e r^. We call the depth rux' :— £{z) at which this happens, the 
separation level. If we consider n^e Wq, this also includes the most important empty 
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Figure 4: Effectively recursed tree. Closed form solutions are used for intervals 
containing no or only a single (multi)point. 

and singleton case. In this way, the recursion always terminates. For instance, for 
r = [0,1), if e :— mm{\x'^ — x^ : ^ x^ with e D} is the shortest distance, 

then m^/ < log2- =: mo < oo, since £ > 0. At the separation level we can insert the 
derived formulas (for evidence, posterior, dimension, height, moments) for multi- 
points (Figure 4). Note, there is no approximation here. The procedure is exact, 
since we analytically computed the infinite recursion for multi-points. 

So we have devised a finite procedure for exactly computing all quantities of 
interest. In the worst case, wc have to recurse down to level mo for each data point, 
hence our procedure has computational complexity O(n-mo). For non-singular prior, 
the time is actually 0{n) with probability 1. So, inference in our mixture tree model 
is very fast. Polya trees for suitable Beta prior (should) admit similar algorithms. 

Multi-point divergences. Consider again s = | and a = l. Since E -J\N ^^^\D'^\ = oo 
recurses up, we have E\N^^\D\ = oo for all D. We now discuss possible diver- 
gences caused by true multi-points nj>2 in D. Ez[h^^^{x)\D'^]—oo recurses up to 
E[h^^{x)\D] = oo. Similarly, Pz[N^^^ <oo\D=]^0, recurses up to P[N^^ <oo\D]^0, 
since gz{Dz) = l along the path. 

There are interesting cases where p{D) = oo, but posteriors are finite, since in- 
finities cancel out. These are p{x\D) if x occurs at most once in D, and p{x,x,\D) if 
x^D, otherwise p{x,...,x\D) — oo. This is is very welcome: E[q{x)\D] <oo for all x, 
if D contains only singletons, which is true w.p.l for all regular q{). The posterior 
variance of q{x) is finite iff x^D, which could be better. Wc adapt recursion (13) 
by scaling p{D), p{D,x) and p{D,x,x) with the same constant Cm^O such that they 
stay finite, which works since D, {D,x) and {D,x,x) have the same triple, quadruple, 
... points. Choose I large enough (separation level) so that is a triple-or-higher 
point, and w.l.g. D^i is empty. Then the recursion (13) reduces to recursion (34). 
The 1+ in (13) gets swamped by Pzo(-D^) = oo and can be dropped. The remain- 
ing recursion is just a multiplication with w, which allows us to rescale pz{D^) to 
Pz{D^) = w~' (cf. (34)). We return a fiag in the recursion indicating that the ^+ 
shall be dropped along the path back to the root, since the true original Pz{D^) 
was infinite and would have swamped them all. We compute p{D), p{D,x), and 
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p{D,x,x) in this way to arrive at finite posteriors (not forgetting that the numbers 
we get for the evidences themselves are fictitious). 

A heuristic way of regularizing our model to yield always finite results could be 
to cut the recursion short at the separation level by definition, and then assign some 
regular (e.g. uniform) distribution to this leaf. Since the most important quantities 
are finite anyway, we refrain from such a data-dependent non-Bayesian hack. Better 
is to assign a smaller prior weight s < | to a split; then more (higher) moments 
become finite. 

Consistency (n — > oo). What remains to be shown is posterior consistency for 
m = oo similarly to the m<oo case. We will show weak consistency in the sense that 
p{{qz}\D) concentrates around {Qz}, where {q^} is a finite collection of branching 
probabilities. Consistency holds because the recursion ior p{{qz}\D) terminates at 
a depth independent of m (for sufficiently large m), so we are effectively in the 
finite tree case. The only difficulty is that the recursion involves Pz{Dz) which still 
has recursion depth (i.e. depends on) m. One solution could be to assume that all 
observations have some finite precision 2^™ and x&D'^Tx' G-D', where i{x') = m! 
and X e Fj./. This would make all involved recursions terminate at depth m' and 
hence all recursions and results for finite m apply (with m' instead of m). More 
interesting is to keep D and treat the m = oo case properly. We show that for 
n^oo, the evidence and the posterior converge uniformly in m<oo, which implies 
convergence also for m = oo. We sometimes indicate the m-dependence of by 
and define :=lim^^ooPr limit exists (possibly infinite), but mostly drop 

the superscript m<oo. 

Theorem 6 (Weak consistency for infinite trees) Let i{z)=l<m' <m+l<oo 
and Qz*' = {qzy '■ y € -2?™ ')■ Then evidence and p(oste)rior Pz{---) exist for m = oo 
and have the following properties for m<oo, where convergence '^^^^ holds w.p.l 
and is uniform in m: 

i) The marginal prior Pz{qz*') is independent ofm, 
a) Pz{Dz) "^^^ oo forus>0 and non-uniform qz{). 

Hi) Pz{qz*'\Dz) n Hlzy-Qzy)- 

weak , , 

yelBf-'- 

iv) Pz{Vy.,...,Vyu\Dz) qz{Vy.)-...-qz{Vyu) fory'eB* andkeN. 

{iv) implies weak convergence of p(a;|Z^) to q{x) in the sense that J f {x)p{x\D)dx ^ 
J f{x)q{x)dx for continuous functions /, by an argument similar to the proof of 
[Fab64, Thm.2.2]. Also the distribution function P[x<a\D]^ P[x<a\q]. 

Proof. We only have to consider m<oo. The m—oo case follows for (i) by definition 
and for (ii) — (iv) since convergence '"^^^ is uniform in m. 
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(i) The prior marginal is 



Pziqz*') = j PziQz*) n ^9^y (35) 



For I = m' we have PziQz*') = Pz{0) = 1 independent of m. For I < m' , inserting 
recursion (12) into (35) and using iBI"-'\iBl"'-' = {0,1} x (iBr"^'+^^iB"'"^'+^^) and 
(35) for zO and zl backwards, we get 

Pziqz*') =u Y[ S{qzy-l) + s-Bet&{qzo,qzi\(^,(^)Pzo{qzO*')Pzi{qzi*') 

So the recursion of ^^(g^*/) and its termination is independent of m, hence p^{qz*') 
is independent of m, hence p'^{qz*') exists and equals p'^{qz*') for m>m'. 

(ii) First note that Pz{Dz) in general depends on m. Further, Pz{Dz)>u>Q \/z. 
Assume first that qzQ^\- Then from (13) and Theorem 3(in) we get 

(T^\ ^ Pzo{Dzo)Pzl{Dzl) , SU^ n.^oo 
Pz{Dz) = U + S J— > ^ OO (36) 

Divergence of Pz{Dz) is uniform in m, since is independent m. Now consider 
the more general case of non- uniform g^O, i.e. 3y : q^yQ ^ \. Then (36) implies 
Pzy{Dzy) -^^oo. Further, for any z, if p^o^^oO; then p^—^oo, since Pzi^u>Q and 
Wn^ = 0{^/n^) by Theorem 3, and similarly if p^i oo. So by induction, p^y — oo 
uniformly in m implies Pz oo uniformly in m. Finally, p'^{Dz) exists, since 
p^{Dz) is independent m beyond the data separation level, as shown earlier. 

(Hi) Similarly to the marginal prior recursion in (^) one can show that the 
recursion for the marginal posterior Pz{qz*'\Dz) has the same form as the recursion 
(15) of ^2(^^*1-0^) with m replaced by m'. Contrary to the prior, the posterior still 
depends on m through Pz{Dz)- Choosing m beyond the data separation level does 
not help since it increases with n^. Nevertheless, the proof of (iii) is the same as for 
Theorem 5 with m^m! . Convergence is uniform in m, since convergence (25) and 
divergence of Pz{Dz) are uniform. Finally, p'^{qz*'\Dz) exists, since its recursion is 
finite (terminates at m') and p^ {D z) exists. 

(iv) Choose m'>max{£(y^),...,£(j/'^)}. Then 

Pz{Vyi, ...,Vyk\Dz) = j qz{^yi) ■ ■■■ ■ qzO^yk) ' Pz{qz*'\D z)dqz*' 
exists, and {iv) now follows from {iii) and qz{Ty)=qy-^^- ...■qy^,^^^^y □ 



7 The Algorithm 

What it computes. In the last two sections we derived all necessary formulas 
for making inferences with our tree model. Collecting pieces together we get the 
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exact algorithm for infinite tree mixtures below. It computes the evidence p{D), the 
expected tree height E[h^^{x)\D] at x, the average expected tree height E[h^jD], 
and the model dimension distribution P[N^_^\D]. It also returns the number of 
recursive function calls, i.e. the size of the explicitly generated tree. The size is 
proportional to n for regular distributions q. 

The BayesTree algorithm (in pseudo C code) takes arguments {DW,n,x,N); data 
array D[0..n — 1] G [0,1)'^, a point xElR, and an integer N. It returns (p,/t,/t,p[],r); 
the logarithmic data evidence p=lnp(D), the expected tree height h^E[h^^{x)\D] at 
X, the average expected tree height h=E[hg-jD], the model dimension distribution 
p[O..N — 1]=P[N^^ = ..\D], and the number of recursive function calls r i.e. the size 
of the generated tree, s, u and a are the global model parameters. Computation 
time is about N'^nXogn nano-seconds on a IGHz P4 laptop. 

BayesTree(i:)[] ,n,a:;, A/") 

[ if (n<l and (n==0 or D[Q]==x or a;^[0,l))) 
[ if (.tg[0,1)) h = s/u; else /i = 0; 

h = s/u] p = ln(l); r = l; 
[ for(A; = 0,..,7V-l) p[k] = ak; /* see (33) */ 

else 

[ no = ni = 0; 
for(i = 0,..,n — 1) 

[ if (D[^]<i) then[Do[no]=2D\i]; no = no + l;] 
[ else [L'i[ni] = 2L)[i]-l; ni = ni + l;] 

{po,ho,ho,Po[],ro)=BayesTree{Do[],no,2x,N-l); 
{PiM ,hi ,pi [] ,ri) =BayesTree(i:)i [] ,ni ,2x-l,N-l); 

t^Po+Pi-lnw{no,ni); /* see (14) */ 

if (i<100) thenp = ln(ii+s-exp(i)); 

else p—t+lii{s); 
g^l-u-ex.p(-p); 

if (xe[0,l)) then h^g-{l+ho+hi); else /i = 0; 

^ n+2a " n+2a 

p[0] = !-(?; 

for(A: = 0,.,iV-l) p{M\^g^\^M\-p^{k-i\- 
[ r = l+ro+ri; 
[ return (p,/i,/i,p[],r); 

How algorithm BayesTree() works. Since evidence p(i^) and weight l/w^ 
can grow exponentially with n, we have to store and use their logarithms. So 
the algorithm returns p=lnp(D). In the n< \ branch, the closed form solutions 
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p=lnp(0) =ln(l), h^E[h^^{x)\0 or x] — 1, h=E[hg'^\D] — 1, and p[k] —ak have been 
used to truncate the recursion. If D = {x^) ^ x, we have to recurse further until 
X falls in an empty interval. In this case or if n > 1 we partition D into points 
left and right of |. Then we rescale the points to [0,1) and store them in Dq and 
Di, respectively. Array D could have been reused (like in quick sort) without allo- 
cating two new arrays. Then, algorithm BayesTree() is recursively called for each 
partition. The results are combined according to the recursions derived in Section 
2. \nw can be computed from (14) via lnn\ = J2k=i^^k. (Practically, pre-tabulating 
Gk or n\ does not improve overall performance). For computing p we need to use 
ln(|(l + e*))=i — ln2 to machine precision for large t in order to avoid numerical 
overflow. 

Remcirks. Strictly speaking, the algorithm has runtime O(nlogn), since the sorting 
effectively runs once through all data at each level. If we assume that the data are 
presorted or the counts are given, then the algorithm is 0{n) [Hut07]. 

We have not presented the part handling multi-points. Given the formulas in 
Section 6, this is easy. The complete C code is available from [Hut07]. 

Note that x passed to BayesTree() is not and cannot be used to compute p{x\D). 
For this, one has to call BayesTree() twice, with D and {D,x), respectively. The 
quadratic order in is due to the convolution, which could be reduced to 0{N\ogN) 
by transforming it to a scalar product in Fourier space with FFT. 

Multiply calling BayesTree(), e.g. for computing the predictive density function 
p{x\D) on a fine x-grid, is inefficient. But it is easy to see that if we once pre-compute 
the evidence p^iD^) for all z up to the separation level in time 0{n), we can compute 
"local" quantities like p{x\D) at x in time O(logn). This is because only the branch 
containing x needs to be recursed, the other branch is immediately available, since 
it involves the already pre-computed evidence only. The predictive density p{x\D) = 
E[q{x)\D] and higher moments, the distribution function P[x<a\D], updating D by 
adding or removing one data item, and most other local quantities can be computed 
in time O(logn) by such a linear recursion. 

A good way of checking correctness of the implementation and of the derived 
formulas, is to force some minimal recursion depth m'. The results must be inde- 
pendent of m', since the closed-form speedups are exact and applicable anywhere 
beyond the separation level. 

8 Numerical Examples 

Graphs and examples. To get further insight into the behavior of our model, we 
numerically investigated some example distributions q{). We have chosen elementary 
functions, which can be regarded as prototypes for more realistic functions. They 
include the Beta, linear, a singular, and piecewise constant distributions with finite 
and infinite Bayes trees. These examples on [0,1) also shed hght on the other spaces 
discussed in Section 2, since they are isomorphic. The posteriors, model dimensions. 
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Figure 5: BayesTree() results for the Beta(3,6)ocx^(l — x)^ distribution, prototype 
for a smooth distribution. 
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Figure 6: BayesTree() results for the Singular distribution q{x) = 2/y/l — x, proto- 
type for a proper singular distribution. 
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O 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

Figure 7: BayesTree() results for the Linear distribution q{x) = 2x, prototype for a 
continuous function. 




O 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

Figure 8: BayesTree() results for the Jump-at-1/2 distribution q{x) = 9/5 for 
X < 1/2 and q{x) = 1/5 for x > 1/2, prototype for a piecewise constant function 
with a finite Bayes tree. 
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Figure 9: BayesTree() results for the Jump-at-1/3 distribution q{x) = 2/3 for 
X < 1/3 and q{x) = 1/3 for x > 1/3, prototype for a piecewise constant function 
with an infinite Bayes tree. 

tree heights, and variances are plotted in Figures 5-9 for random samples D of sizes 
n = 10°,...,10^. We first discuss observations common to all sampling distributions, 
thereafter specific aspects. All experiments were performed with split probability 
s = | and uniform distribution a = l. 

General observations. The posteriors p{x\D) clearly converge for n^oo to the 
true distribution g(), accompanied by a (necessary) moderate growth of the effective 
dimension (except for Jump-at-1/2). For n = 10 we show the data points. It is visible 
how each data point pulls the posterior up, as it should be ("one sample seldom 
comes alone"). 

Compare this to an empirical bin model with bins. Since each bin con- 
tains 0{n/N) data points, the frequency estimate nbin/n of q{bin) has accuracy 
N /n). The minimal error when approximating a continuous function by a 
piecewise constant function with bin size 1/iV is 0(1/A^), so the estimate has total 
error max{0(^iV/n),0(l/A'^)} with minimum 0{n-^/-') aX'^ N = n^/'^. This is nicely 
consistent with our model dimension. Look at the maxima of dimension distribution 
P[A^|Z)] or count the number of significant jumps in the posterior p{x\D). 

The figures also show that the posterior variances Var[g(x)|D] converge to zero 
for n— i>oo, but diverge when x tends to a point in consistent with the theoretical 

^Sometimes heuristic N = ^Jn is proposed, which makes no sense. 



28 



analysis of multi-points in Section 6. The expected tree height E[h{x)\D] at x 
correctly reflects the local needs for (non)splits. 

Specific observations. Beta: The Beta distribution Beta,{x\a,P)(xx°'~^{l—x)^~^ 
is prototypical for a smooth unimodal distribution. Apart from local jitter, the 
simplest model consistent with data size n=10 is essentially a Jump-at-1/2 function 
(see below). The tree height slowly increases with n with a dip around the "flat" 
maximum of the Beta, since a constant approximation works well there. 

Singular: We used the distribution q{x) — 2/^/1— x as a prototype for a proper 
singular distribution. The tree height is necessarily larger near the singularity at 
x^l. 

Linear: Once continuously differentiable functions are locally linear, so the linear 
distribution q{x) = 2x serves as a prototype for them. The better approximation of 
p{x\D) near versus near 1, accompanied by a higher tree, is remarkable. First, 
there arc fewer data points near to warrant this, and second, the region is less 
interesting, since more samples are at 1. So we expected quite the opposite behavior. 
We currently have no explanation for this phenomenon. 

Jump-at-1/2: Also illustrative are distributions with finite Bayes tree, i.e. piece- 
wise constant functions with discontinuities only at binary fractions. We consider 
the prototype that jumps at a; = 1/2. All quantities converge rapidly. We see that 
model dimension and tree height stay finite in this they should. Both con- 

verge to the minimal consistent value 1. The variance in the left and right half of 
[0,1) is roughly proportional to q therein. 

Jump-at-1/3: A jump at a non-binary fraction cannot be modeled with a finite 
tree. Convergence is slower than for Jump-at-1/2, but faster than for the other 
examples, which makes sense since only one branch of the tree has to grow to infinity. 
This shows up in a slower increase of dimension, a converging height function with 
singularity at 1/3, and a narrowing spike in the variance. 

9 Discussion 

We presented a Bayesian model on infinite trees, where we split a node into two 
subtrees with some probability, and assigned a Beta distributed probability to each 
subtree. We were primarily interested in the case of zero prior knowledge. In this 
case, scale invariance and symmetry should be preserved. Scale invariance requires 
the parameters to be the same in each node and symmetry requires a symmetric 
Beta, leaving one splitting probability s and one Beta parameter (5 as adjustable 
parameters. We devised closed form expressions for various inferential quantities of 
interest at the data separation level, which led to an exact algorithm with runtime 
essentially linear in the data size. 

We introduced and studied this two-parameter tree-model class. The most in- 
teresting case of splitting probability s = \ and uniform prior over subtrees (5 — 1 
has been studied in more detail. The theoretical and numerical model behavior was 
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very reasonable, e.g. consistency (no underfitting) and low finite effective dimension 
(no overfitting) . Higher moments can be made finite by smaller s or larger f3. 

There are various natural generalizations of our model. The splitting proba- 
bility s and Beta parameter /3 could be made dependent on the node of the tree, 
which allows incorporating prior knowledge, k-ary trees could be allowed with Beta 
generalized to Dirichlet distributions. Non-symmetric partitions are straightforward 
to implement by replacing all S{qz — ^) with ^(g^ — |r2|/|r2j.;_J), and possibly using 
non-symmetric Betas. The expected entropy can also be computed by allowing frac- 
tional counts riz and noting that xlnx = -^x°'\a=i [Hut02, HZ05]. A sort of maximum 
a posteriori (MAP) tree skeleton can also easily be read off from (13). A node 
in the MAP-hke tree is a leaf iff s^^oCd^oW^ ^ challenge is to generahze the 
model from piecewise constant to piecewise linear continuous functions, at least for 
r = [0,l). Independence of subtrees no longer holds, which was key in our analysis. 

If r is not already a tree or binary string, but an interval, a major problem 
of Polya trees and our tree model are partition artifacts in the estimated density. 
Numerically but unlikely analytically it is possible to average over boundary loca- 
tions like in [PRLW03] and smooth out discontinuities. Interestingly for flat bin 
estimation, analytical averaging is possible via dynamic programming [EF05]. 
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